focal length
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- South America > Brazil (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States (0.04)
- Asia > Middle East > Israel (0.04)
MonoUNI: A Unified Vehicle and Infrastructure-side Monocular 3D Object Detection Network with Sufficient Depth Clues
Monocular 3D detection of vehicle and infrastructure sides are two important topics in autonomous driving. Due to diverse sensor installations and focal lengths, researchers are faced with the challenge of constructing algorithms for the two topics based on different prior knowledge. In this paper, by taking into account the diversity of pitch angles and focal lengths, we propose a unified optimization target named normalized depth, which realizes the unification of 3D detection problems for the two sides. Furthermore, to enhance the accuracy of monocular 3D detection, 3D normalized cube depth of obstacle is developed to promote the learning of depth information. We posit that the richness of depth clues is a pivotal factor impacting the detection performance on both the vehicle and infrastructure sides. A richer set of depth clues facilitates the model to learn better spatial knowledge, and the 3D normalized cube depth offers sufficient depth clues. Extensive experiments demonstrate the effectiveness of our approach. Without introducing any extra information, our method, named MonoUNI, achieves state-of-the-art performance on five widely used monocular 3D detection benchmarks, including Rope3D and DAIR-V2X-I for the infrastructure side, KITTI and Waymo for the vehicle side, and nuScenes for the cross-dataset evaluation.
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- South America > Brazil (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States (0.04)
- Asia > Middle East > Israel (0.04)
MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy
Li, Wuyang, Pan, Wentao, Liu, Xiaoyuan, Luo, Zhendong, Li, Chenxin, Liu, Hengyu, Tsai, Din Ping, Chen, Mu Ku, Yuan, Yixuan
Miniaturized endoscopy has advanced accurate visual perception within the human body. Prevailing research remains limited to conventional cameras employing convex lenses, where the physical constraints with millimetre-scale thickness impose serious impediments on the micro-level clinical. Recently, with the emergence of meta-optics, ultra-micro imaging based on metalenses (micron-scale) has garnered great attention, serving as a promising solution. However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. First, we establish datasets for metalens endoscopy and conduct preliminary optical simulation, identifying two derived optical issues that physically adhere to strong optical priors. Second, we propose MetaScope, a novel optics-driven neural network tailored for metalens endoscopy driven by physical optics. MetaScope comprises two novel designs: Optics-informed Intensity Adjustment (OIA), rectifying intensity decay by learning optical embeddings, and Optics-informed Chromatic Correction (OCC), mitigating chromatic aberration by learning spatial deformations informed by learned Point Spread Function (PSF) distributions. To enhance joint learning, we further deploy a gradient-guided distillation to transfer knowledge from the foundational model adaptively. Extensive experiments demonstrate that MetaScope not only outperforms state-of-the-art methods in both metalens segmentation and restoration but also achieves impressive generalized ability in real biomedical scenes.
- Asia > China > Hong Kong (0.05)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- (4 more...)
- Research Report > Promising Solution (0.68)
- Research Report > New Finding (0.46)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
StixelNExT++: Lightweight Monocular Scene Segmentation and Representation for Collective Perception
Vosshans, Marcel, Ait-Aider, Omar, Mezouar, Youcef, Enzweiler, Markus
This paper presents StixelNExT++, a novel approach to scene representation for monocular perception systems. Building on the established Stixel representation, our method infers 3D Stixels and enhances object segmentation by clustering smaller 3D Stixel units. The approach achieves high compression of scene information while remaining adaptable to point cloud and bird's-eye-view representations. Our lightweight neural network, trained on automatically generated LiDAR-based ground truth, achieves real-time performance with computation times as low as 10 ms per frame. Experimental results on the Waymo dataset demonstrate competitive performance within a 30-meter range, highlighting the potential of StixelNExT++ for collective perception in autonomous systems.
- Asia > South Korea (0.14)
- Europe > France > Auvergne-Rhône-Alpes > Puy-de-Dôme > Clermont-Ferrand (0.04)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- (7 more...)
- Transportation > Ground > Road (0.46)
- Transportation > Infrastructure & Services (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography
Fang, I-Sheng, Chen, Jun-Cheng
Large language models (LLMs) and multimodal large language models (MLLMs) have significantly advanced artificial intelligence. However, visual reasoning, reasoning involving both visual and textual inputs, remains underexplored. Recent advancements, including the reasoning models like OpenAI o1 and Gemini 2.0 Flash Thinking, which incorporate image inputs, have opened this capability. In this ongoing work, we focus specifically on photography-related tasks because a photo is a visual snapshot of the physical world where the underlying physics (i.e., illumination, blur extent, etc.) interplay with the camera parameters. Successfully reasoning from the visual information of a photo to identify these numerical camera settings requires the MLLMs to have a deeper understanding of the underlying physics for precise visual comprehension, representing a challenging and intelligent capability essential for practical applications like photography assistant agents. We aim to evaluate MLLMs on their ability to distinguish visual differences related to numerical camera settings, extending a methodology previously proposed for vision-language models (VLMs). Our preliminary results demonstrate the importance of visual reasoning in photography-related tasks. Moreover, these results show that no single MLLM consistently dominates across all evaluation tasks, demonstrating ongoing challenges and opportunities in developing MLLMs with better visual reasoning.